Informedia @ Trecvid 2011 Informedia @ Trecvid 2011 Multimedia Event Detection, Semantic Indexing 1 Multimedia Event Detection (med) 1.1 Feature Extraction
نویسندگان
چکیده
We report on our results in the TRECVID 2011 Multimedia Event Detection (MED) and Semantic Indexing (SIN) tasks. Generally, both of these tasks consist of three main steps: extracting features, training detectors and fusing. In the feature extraction part, we extracted many low-level features, high-level features and text features. We used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalanced classification problem. In the fusion part, to take the advantages from different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better than or at least comparable to early fusion and late fusion. 1 Multimedia Event Detection (MED) 1.1 Feature Extraction In order to encompass all aspects of a video, we extracted a wide variety of visual and audio features as shown in figure 1. Table 1: Features used for the MED task. Visual Features Audio Features Low-level Features • SIFT [19] • Color SIFT [19] • Transformed Color Histogram [19] • Motion SIFT [3] • STIP [9] Mel-Frequency Cepstral Coefficients High-level Features • PittPatt Face Detection [12] • Semantic Indexing Concepts [15] Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition 1.1.1 SIFT, Color SIFT (CSIFT), Transformed Color Histogram (TCH) These three features describe the gradient and color information of a static image. We used the Harris-Laplace detector for corner detection. For more details, please see [19]. Instead of extracting features from all frames for all videos, we first run shot-break detection and only extract features from the keyframe of a corresponding shot. The shot-break detection algorithm detects large color histogram differences between adjacent frames and a shot-boundary is detected when the histogram difference is larger than a threshold. For the 16507 training videos, we extracted 572,881 keyframes. For the 32061 testing videos, we extracted 1,035,412 keyframes. Once we have the keyframes, we extract the three features as in [19]. Given the raw feature files, a 4096 word codebook is acquired using the K-Means clustering algorithm. According to the codebook and given a region in an image, we can create a 4096 dimensional vector representing that region. Using the Spatial-Pyramid Matching [10] technique, we extract 8 regions from an keyframe image and calculate a bag-of-words vector for each region. At the end, we get a 8× 4096 = 32768 dimensional bag-of-words vector. The 8 regions are calculated as follows. • The whole image as one region. • Split the image into 4 quadrants and each quadrant is a region. • Split the image horizontally into 3 equally sized rectangles and each rectangle is a region. Since we only have feature vectors describing a keyframe, and a video is described by many keyframes, we compute a vector representing a whole video by averaging over the feature vectors from each keyframe. The features are then provided to a classifier for classification. 1.1.2 Motion SIFT (MoSIFT) Motion SIFT [3] is a motion-based feature that combines information from SIFT and optical flow. The algorithm first extract SIFT points, and for each SIFT point, it checks whether there is a large enough optical flow near the point. If the optical flow value is larger than a threshold, a 256 dimensional feature is computed for that point. The first 128 dimensions of the feature vector is the SIFT descriptor, and the latter 128 dimensions describes the optical flow near the point. We extracted Motion SIFT by calculating the optical flow between neighboring frames, but due to speed issues, we only extract Motion SIFT for the every third frame. Once we have the raw features, a 4096 dimensional codebook is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.3 Space-Time Interest Points (STIP) Space-Time Interest Points are computed like in [9]. Given the raw features, a 4096 dimensional code is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.4 Semantic Indexing (SIN) We predicted the 346 semantic concepts from Semantic Indexing 11 onto the MED keyframes. For details on how we created the models for the 346 concepts, please refer to section 2. Once we have the prediction scores of each concept on each keyframe, we compute a 346 dimensional feature that represents a video. The value of each dimension is the mean value of the concept prediction scores on all keyframes in a given video. We tried out different kinds of score merging techniques, including mean and max, and mean had the best performance. These features are then provided to a classifier for classification.
منابع مشابه
Informedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)
We report on our system used in the TRECVID 2013 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of four main steps: extracting features, representing features, training detectors and fusion. In the feature extraction part, we extract more than 10 low-level, high-level, and text features. Those features are then represented in three different w...
متن کاملIBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System
The IBM Research/Columbia team investigated a novel range of low-level and high-level features and their combination for the TRECVID Multimedia Event Detection (MED) task. We submitted four runs exploring various methods of extraction, modeling and fusing of low-level features and hundreds of high-level semantic concepts. Our Run 1 developed event detection models utilizing Support Vector Machi...
متن کاملInformedia@trecvid 201 4 Med and Mer Med System
We report on our system used in the TRECVID 2014 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. On the MED task, the CMU team achieved leading performance in the Semantic Query (SQ), 000Ex, 010Ex and 100Ex settings. Furthermore, SQ and 000Ex runs are significantly better than the submissions from the other teams. We attribute the good performance to 4 main compone...
متن کاملSemantic Indexing and Multimedia Event Detection: ECNU at TRECVID 2012
1 Abstract This year we participated in two tasks: Semantic Indexing (SIN) and Multimedia Event Detection (MED). In this paper, we present our approaches and discuss the evaluation results. Semantic Indexing (SIN): For video semantic indexing, we focus on the performance improvement by using a Weighted Hamming Embedding kernel compared with traditional BoW approaches. Below are the brief descri...
متن کاملNational Institute of Informatics , Japan at TRECVID 2011
This paper reports our experiments for three TRECVID 2011 tasks: instance search, semantic indexing, and multimedia event detection. For the instance search task, we present three different approaches: (i) Large vocabulary quantization by hierarchical k-means and weighted histogram intersection based ranking metric (ii) Combination of similarities based on Glocal quantization of two set of SIFT...
متن کامل